This notebook is built on the problem and data of Home Credit predicting default risk. The data and Kaggle information can be found at this link https://www.kaggle.com/competitions/home-credit-default-risk/code.
The data used in this project is regarding an individuals characteristics; income, occupation, family size, etc. There is a train and test file with similar variables. There is also data about an individual’s transactions, balances, and other financial information.
Home Credit wants to help customers have a good experience by more accurately predicting who should be approved with the ability to pay back the loan and those who should be rejected who will be unable to pay it back.
We will build a supervised categorical model, predicting whether someone should be approved for a loan (1 meaning should not be approved, 0 meaning should be approved). We will use the data from Kaggle about the individual, their transactions, and other financial data. We’ll explore the data, clean it, create visualizations, and perform feature engineering to help maximize the effectiveness of our model.
#Questions What data is not need and can be removed? What variables have a high correlation to the target variable? What occupations have the highest number of default? What income types have the highest number of default? What is the target variable?
#Load packages
library(e1071)
library(psych)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Loading required package: lattice
library(rminer)
library(rmarkdown)
library(tictoc)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” lubridate 1.9.2 âś” tibble 3.2.1
## âś” purrr 1.0.2 âś” tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– ggplot2::%+%() masks psych::%+%()
## âś– ggplot2::alpha() masks psych::alpha()
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## âś– purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(C50)
library(matrixStats)
##
## Attaching package: 'matrixStats'
##
## The following object is masked from 'package:dplyr':
##
## count
library(knitr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(xgboost)
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(DataExplorer)
tic()
# Set working directory
cloud_wd <- getwd()
setwd(cloud_wd)
# Read the data into data frames and set strings as factors
app_test <- read.csv(file = "application_test.csv", stringsAsFactors = TRUE)
app_train <- read.csv(file = "application_train.csv", stringsAsFactors = TRUE)
bur_bal <- read.csv(file = "bureau_balance.csv", stringsAsFactors = TRUE)
bur <- read.csv(file = "bureau.csv", stringsAsFactors = TRUE)
cc_bal <- read.csv(file = "credit_card_balance.csv", stringsAsFactors = TRUE)
inst_pay <- read.csv(file = "installments_payments.csv", stringsAsFactors = TRUE)
pos <- read.csv(file = "POS_CASH_balance.csv", stringsAsFactors = TRUE)
pre_app <- read.csv(file = "previous_application.csv", stringsAsFactors = TRUE)
app_train <- app_train %>% mutate(TARGET = factor(TARGET))
# Train and test data structure
str(app_train)
## 'data.frame': 307511 obs. of 122 variables:
## $ SK_ID_CURR : int 100002 100003 100004 100006 100007 100008 100009 100010 100011 100012 ...
## $ TARGET : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
## $ NAME_CONTRACT_TYPE : Factor w/ 2 levels "Cash loans","Revolving loans": 1 1 2 1 1 1 1 1 1 2 ...
## $ CODE_GENDER : Factor w/ 3 levels "F","M","XNA": 2 1 2 1 2 2 1 2 1 2 ...
## $ FLAG_OWN_CAR : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 2 1 1 ...
## $ FLAG_OWN_REALTY : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 2 2 2 ...
## $ CNT_CHILDREN : int 0 0 0 0 0 0 1 0 0 0 ...
## $ AMT_INCOME_TOTAL : num 202500 270000 67500 135000 121500 ...
## $ AMT_CREDIT : num 406598 1293502 135000 312682 513000 ...
## $ AMT_ANNUITY : num 24700 35698 6750 29686 21866 ...
## $ AMT_GOODS_PRICE : num 351000 1129500 135000 297000 513000 ...
## $ NAME_TYPE_SUITE : Factor w/ 8 levels "","Children",..: 8 3 8 8 8 7 8 8 2 8 ...
## $ NAME_INCOME_TYPE : Factor w/ 8 levels "Businessman",..: 8 5 8 8 8 5 2 5 4 8 ...
## $ NAME_EDUCATION_TYPE : Factor w/ 5 levels "Academic degree",..: 5 2 5 5 5 5 2 2 5 5 ...
## $ NAME_FAMILY_STATUS : Factor w/ 6 levels "Civil marriage",..: 4 2 4 1 4 2 2 2 2 4 ...
## $ NAME_HOUSING_TYPE : Factor w/ 6 levels "Co-op apartment",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ REGION_POPULATION_RELATIVE : num 0.0188 0.00354 0.01003 0.00802 0.02866 ...
## $ DAYS_BIRTH : int -9461 -16765 -19046 -19005 -19932 -16941 -13778 -18850 -20099 -14469 ...
## $ DAYS_EMPLOYED : int -637 -1188 -225 -3039 -3038 -1588 -3130 -449 365243 -2019 ...
## $ DAYS_REGISTRATION : num -3648 -1186 -4260 -9833 -4311 ...
## $ DAYS_ID_PUBLISH : int -2120 -291 -2531 -2437 -3458 -477 -619 -2379 -3514 -3992 ...
## $ OWN_CAR_AGE : num NA NA 26 NA NA NA 17 8 NA NA ...
## $ FLAG_MOBIL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_EMP_PHONE : int 1 1 1 1 1 1 1 1 0 1 ...
## $ FLAG_WORK_PHONE : int 0 0 1 0 0 1 0 1 0 0 ...
## $ FLAG_CONT_MOBILE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_PHONE : int 1 1 1 0 0 1 1 0 0 0 ...
## $ FLAG_EMAIL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OCCUPATION_TYPE : Factor w/ 19 levels "","Accountants",..: 10 5 10 10 5 10 2 12 1 10 ...
## $ CNT_FAM_MEMBERS : num 1 2 1 2 1 2 3 2 2 1 ...
## $ REGION_RATING_CLIENT : int 2 1 2 2 2 2 2 3 2 2 ...
## $ REGION_RATING_CLIENT_W_CITY : int 2 1 2 2 2 2 2 3 2 2 ...
## $ WEEKDAY_APPR_PROCESS_START : Factor w/ 7 levels "FRIDAY","MONDAY",..: 7 2 2 7 5 7 4 2 7 5 ...
## $ HOUR_APPR_PROCESS_START : int 10 11 9 17 11 16 16 16 14 8 ...
## $ REG_REGION_NOT_LIVE_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LIVE_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_LIVE_CITY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 0 1 0 0 ...
## $ LIVE_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ORGANIZATION_TYPE : Factor w/ 58 levels "Advertising",..: 6 40 12 6 38 34 6 34 58 10 ...
## $ EXT_SOURCE_1 : num 0.083 0.311 NA NA NA ...
## $ EXT_SOURCE_2 : num 0.263 0.622 0.556 0.65 0.323 ...
## $ EXT_SOURCE_3 : num 0.139 NA 0.73 NA NA ...
## $ APARTMENTS_AVG : num 0.0247 0.0959 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_AVG : num 0.0369 0.0529 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_AVG : num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_AVG : num 0.619 0.796 NA NA NA ...
## $ COMMONAREA_AVG : num 0.0143 0.0605 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_AVG : num 0 0.08 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_AVG : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_AVG : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_AVG : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_AVG : num 0.0369 0.013 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_AVG : num 0.0202 0.0773 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_AVG : num 0.019 0.0549 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_AVG : num 0 0.0039 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_AVG : num 0 0.0098 NA NA NA NA NA NA NA NA ...
## $ APARTMENTS_MODE : num 0.0252 0.0924 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_MODE : num 0.0383 0.0538 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_MODE: num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_MODE : num 0.634 0.804 NA NA NA ...
## $ COMMONAREA_MODE : num 0.0144 0.0497 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_MODE : num 0 0.0806 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_MODE : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_MODE : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_MODE : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_MODE : num 0.0377 0.0128 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_MODE : num 0.022 0.079 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_MODE : num 0.0198 0.0554 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_MODE : num 0 0 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_MODE : num 0 0 NA NA NA NA NA NA NA NA ...
## $ APARTMENTS_MEDI : num 0.025 0.0968 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_MEDI : num 0.0369 0.0529 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_MEDI: num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_MEDI : num 0.624 0.799 NA NA NA ...
## $ COMMONAREA_MEDI : num 0.0144 0.0608 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_MEDI : num 0 0.08 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_MEDI : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_MEDI : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_MEDI : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_MEDI : num 0.0375 0.0132 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_MEDI : num 0.0205 0.0787 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_MEDI : num 0.0193 0.0558 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_MEDI : num 0 0.0039 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_MEDI : num 0 0.01 NA NA NA NA NA NA NA NA ...
## $ FONDKAPREMONT_MODE : Factor w/ 5 levels "","not specified",..: 4 4 1 1 1 1 1 1 1 1 ...
## $ HOUSETYPE_MODE : Factor w/ 4 levels "","block of flats",..: 2 2 1 1 1 1 1 1 1 1 ...
## $ TOTALAREA_MODE : num 0.0149 0.0714 NA NA NA NA NA NA NA NA ...
## $ WALLSMATERIAL_MODE : Factor w/ 8 levels "","Block","Mixed",..: 7 2 1 1 1 1 1 1 1 1 ...
## $ EMERGENCYSTATE_MODE : Factor w/ 3 levels "","No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
## $ OBS_30_CNT_SOCIAL_CIRCLE : num 2 1 0 2 0 0 1 2 1 2 ...
## $ DEF_30_CNT_SOCIAL_CIRCLE : num 2 0 0 0 0 0 0 0 0 0 ...
## $ OBS_60_CNT_SOCIAL_CIRCLE : num 2 1 0 2 0 0 1 2 1 2 ...
## $ DEF_60_CNT_SOCIAL_CIRCLE : num 2 0 0 0 0 0 0 0 0 0 ...
## $ DAYS_LAST_PHONE_CHANGE : num -1134 -828 -815 -617 -1106 ...
## $ FLAG_DOCUMENT_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLAG_DOCUMENT_3 : int 1 1 0 1 0 1 0 1 1 0 ...
## $ FLAG_DOCUMENT_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
str(app_test)
## 'data.frame': 48744 obs. of 121 variables:
## $ SK_ID_CURR : int 100001 100005 100013 100028 100038 100042 100057 100065 100066 100067 ...
## $ NAME_CONTRACT_TYPE : Factor w/ 2 levels "Cash loans","Revolving loans": 1 1 1 1 1 1 1 1 1 1 ...
## $ CODE_GENDER : Factor w/ 2 levels "F","M": 1 2 2 1 2 1 2 2 1 1 ...
## $ FLAG_OWN_CAR : Factor w/ 2 levels "N","Y": 1 1 2 1 2 2 2 1 1 2 ...
## $ FLAG_OWN_REALTY : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 2 2 2 ...
## $ CNT_CHILDREN : int 0 0 0 2 1 0 2 0 0 1 ...
## $ AMT_INCOME_TOTAL : num 135000 99000 202500 315000 180000 ...
## $ AMT_CREDIT : num 568800 222768 663264 1575000 625500 ...
## $ AMT_ANNUITY : num 20560 17370 69777 49018 32067 ...
## $ AMT_GOODS_PRICE : num 450000 180000 630000 1575000 625500 ...
## $ NAME_TYPE_SUITE : Factor w/ 8 levels "","Children",..: 8 8 1 8 8 8 8 8 8 3 ...
## $ NAME_INCOME_TYPE : Factor w/ 7 levels "Businessman",..: 7 7 7 7 7 4 7 7 4 7 ...
## $ NAME_EDUCATION_TYPE : Factor w/ 5 levels "Academic degree",..: 2 5 2 5 5 5 2 2 2 2 ...
## $ NAME_FAMILY_STATUS : Factor w/ 5 levels "Civil marriage",..: 2 2 2 2 2 2 2 4 2 1 ...
## $ NAME_HOUSING_TYPE : Factor w/ 6 levels "Co-op apartment",..: 2 2 2 2 2 2 2 6 2 2 ...
## $ REGION_POPULATION_RELATIVE : num 0.0188 0.0358 0.0191 0.0264 0.01 ...
## $ DAYS_BIRTH : int -19241 -18064 -20038 -13976 -13040 -18604 -16685 -9516 -12744 -10395 ...
## $ DAYS_EMPLOYED : int -2329 -4469 -4458 -1866 -2191 -12009 -2580 -1387 -1013 -2625 ...
## $ DAYS_REGISTRATION : num -5170 -9118 -2175 -2000 -4000 ...
## $ DAYS_ID_PUBLISH : int -812 -1623 -3503 -4208 -4262 -2027 -241 -2055 -3171 -3041 ...
## $ OWN_CAR_AGE : num NA NA 5 NA 16 10 3 NA NA 5 ...
## $ FLAG_MOBIL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_EMP_PHONE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_WORK_PHONE : int 0 0 0 0 1 0 0 1 0 1 ...
## $ FLAG_CONT_MOBILE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_PHONE : int 0 0 0 1 0 1 0 1 0 1 ...
## $ FLAG_EMAIL : int 1 0 0 0 0 0 0 0 0 0 ...
## $ OCCUPATION_TYPE : Factor w/ 19 levels "","Accountants",..: 1 11 6 16 1 6 7 5 5 16 ...
## $ CNT_FAM_MEMBERS : num 2 2 2 4 3 2 4 1 2 3 ...
## $ REGION_RATING_CLIENT : int 2 2 2 2 2 2 2 2 1 2 ...
## $ REGION_RATING_CLIENT_W_CITY : int 2 2 2 2 2 2 2 2 1 2 ...
## $ WEEKDAY_APPR_PROCESS_START : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 1 2 7 1 2 5 1 5 6 ...
## $ HOUR_APPR_PROCESS_START : int 18 9 14 11 5 15 9 7 18 14 ...
## $ REG_REGION_NOT_LIVE_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LIVE_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_LIVE_CITY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 1 0 0 0 ...
## $ LIVE_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 1 0 0 0 ...
## $ ORGANIZATION_TYPE : Factor w/ 58 levels "Advertising",..: 29 43 55 6 6 12 27 43 40 47 ...
## $ EXT_SOURCE_1 : num 0.753 0.565 NA 0.526 0.202 ...
## $ EXT_SOURCE_2 : num 0.79 0.292 0.7 0.51 0.426 ...
## $ EXT_SOURCE_3 : num 0.16 0.433 0.611 0.613 NA ...
## $ APARTMENTS_AVG : num 0.066 NA NA 0.305 NA ...
## $ BASEMENTAREA_AVG : num 0.059 NA NA 0.197 NA ...
## $ YEARS_BEGINEXPLUATATION_AVG : num 0.973 NA NA 0.997 NA ...
## $ YEARS_BUILD_AVG : num NA NA NA 0.959 NA ...
## $ COMMONAREA_AVG : num NA NA NA 0.117 NA ...
## $ ELEVATORS_AVG : num NA NA NA 0.32 NA 0.16 NA NA 0 NA ...
## $ ENTRANCES_AVG : num 0.138 NA NA 0.276 NA ...
## $ FLOORSMAX_AVG : num 0.125 NA NA 0.375 NA ...
## $ FLOORSMIN_AVG : num NA NA NA 0.0417 NA 0.375 NA NA NA NA ...
## $ LANDAREA_AVG : num NA NA NA 0.204 NA ...
## $ LIVINGAPARTMENTS_AVG : num NA NA NA 0.24 NA ...
## $ LIVINGAREA_AVG : num 0.0505 NA NA 0.3673 NA ...
## $ NONLIVINGAPARTMENTS_AVG : num NA NA NA 0.0386 NA 0.0116 NA NA NA NA ...
## $ NONLIVINGAREA_AVG : num NA NA NA 0.08 NA 0.0731 NA NA NA NA ...
## $ APARTMENTS_MODE : num 0.0672 NA NA 0.3109 NA ...
## $ BASEMENTAREA_MODE : num 0.0612 NA NA 0.2049 NA ...
## $ YEARS_BEGINEXPLUATATION_MODE: num 0.973 NA NA 0.997 NA ...
## $ YEARS_BUILD_MODE : num NA NA NA 0.961 NA ...
## $ COMMONAREA_MODE : num NA NA NA 0.118 NA ...
## $ ELEVATORS_MODE : num NA NA NA 0.322 NA ...
## $ ENTRANCES_MODE : num 0.138 NA NA 0.276 NA ...
## $ FLOORSMAX_MODE : num 0.125 NA NA 0.375 NA ...
## $ FLOORSMIN_MODE : num NA NA NA 0.0417 NA 0.375 NA NA NA NA ...
## $ LANDAREA_MODE : num NA NA NA 0.209 NA ...
## $ LIVINGAPARTMENTS_MODE : num NA NA NA 0.263 NA ...
## $ LIVINGAREA_MODE : num 0.0526 NA NA 0.3827 NA ...
## $ NONLIVINGAPARTMENTS_MODE : num NA NA NA 0.0389 NA 0.0117 NA NA NA NA ...
## $ NONLIVINGAREA_MODE : num NA NA NA 0.0847 NA 0.0774 NA NA NA NA ...
## $ APARTMENTS_MEDI : num 0.0666 NA NA 0.3081 NA ...
## $ BASEMENTAREA_MEDI : num 0.059 NA NA 0.197 NA ...
## $ YEARS_BEGINEXPLUATATION_MEDI: num 0.973 NA NA 0.997 NA ...
## $ YEARS_BUILD_MEDI : num NA NA NA 0.96 NA ...
## $ COMMONAREA_MEDI : num NA NA NA 0.117 NA ...
## $ ELEVATORS_MEDI : num NA NA NA 0.32 NA 0.16 NA NA 0 NA ...
## $ ENTRANCES_MEDI : num 0.138 NA NA 0.276 NA ...
## $ FLOORSMAX_MEDI : num 0.125 NA NA 0.375 NA ...
## $ FLOORSMIN_MEDI : num NA NA NA 0.0417 NA 0.375 NA NA NA NA ...
## $ LANDAREA_MEDI : num NA NA NA 0.208 NA ...
## $ LIVINGAPARTMENTS_MEDI : num NA NA NA 0.245 NA ...
## $ LIVINGAREA_MEDI : num 0.0514 NA NA 0.3739 NA ...
## $ NONLIVINGAPARTMENTS_MEDI : num NA NA NA 0.0388 NA 0.0116 NA NA NA NA ...
## $ NONLIVINGAREA_MEDI : num NA NA NA 0.0817 NA 0.0746 NA NA NA NA ...
## $ FONDKAPREMONT_MODE : Factor w/ 5 levels "","not specified",..: 1 1 1 4 1 2 1 1 1 1 ...
## $ HOUSETYPE_MODE : Factor w/ 4 levels "","block of flats",..: 2 1 1 2 1 2 1 1 2 1 ...
## $ TOTALAREA_MODE : num 0.0392 NA NA 0.37 NA ...
## $ WALLSMATERIAL_MODE : Factor w/ 8 levels "","Block","Mixed",..: 7 1 1 6 1 2 1 1 7 1 ...
## $ EMERGENCYSTATE_MODE : Factor w/ 3 levels "","No","Yes": 2 1 1 2 1 2 1 1 2 1 ...
## $ OBS_30_CNT_SOCIAL_CIRCLE : num 0 0 0 0 0 0 1 0 0 4 ...
## $ DEF_30_CNT_SOCIAL_CIRCLE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OBS_60_CNT_SOCIAL_CIRCLE : num 0 0 0 0 0 0 1 0 0 4 ...
## $ DEF_60_CNT_SOCIAL_CIRCLE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DAYS_LAST_PHONE_CHANGE : num -1740 0 -856 -1805 -821 ...
## $ FLAG_DOCUMENT_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLAG_DOCUMENT_3 : int 1 1 0 1 1 0 1 0 1 1 ...
## $ FLAG_DOCUMENT_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLAG_DOCUMENT_5 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
# Summarize the test and train data
summary(app_train)
## SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR
## Min. :100002 0:282686 Cash loans :278232 F :202448 N:202924
## 1st Qu.:189146 1: 24825 Revolving loans: 29279 M :105059 Y:104587
## Median :278202 XNA: 4
## Mean :278180
## 3rd Qu.:367142
## Max. :456255
##
## FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT
## N: 94199 Min. : 0.0000 Min. : 25650 Min. : 45000
## Y:213312 1st Qu.: 0.0000 1st Qu.: 112500 1st Qu.: 270000
## Median : 0.0000 Median : 147150 Median : 513531
## Mean : 0.4171 Mean : 168798 Mean : 599026
## 3rd Qu.: 1.0000 3rd Qu.: 202500 3rd Qu.: 808650
## Max. :19.0000 Max. :117000000 Max. :4050000
##
## AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE
## Min. : 1616 Min. : 40500 Unaccompanied :248526
## 1st Qu.: 16524 1st Qu.: 238500 Family : 40149
## Median : 24903 Median : 450000 Spouse, partner: 11370
## Mean : 27109 Mean : 538396 Children : 3267
## 3rd Qu.: 34596 3rd Qu.: 679500 Other_B : 1770
## Max. :258026 Max. :4050000 : 1292
## NA's :12 NA's :278 (Other) : 1137
## NAME_INCOME_TYPE NAME_EDUCATION_TYPE
## Working :158774 Academic degree : 164
## Commercial associate: 71617 Higher education : 74863
## Pensioner : 55362 Incomplete higher : 10277
## State servant : 21703 Lower secondary : 3816
## Unemployed : 22 Secondary / secondary special:218391
## Student : 18
## (Other) : 15
## NAME_FAMILY_STATUS NAME_HOUSING_TYPE
## Civil marriage : 29775 Co-op apartment : 1122
## Married :196432 House / apartment :272868
## Separated : 19770 Municipal apartment: 11183
## Single / not married: 45444 Office apartment : 2617
## Unknown : 2 Rented apartment : 4881
## Widow : 16088 With parents : 14840
##
## REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION
## Min. :0.00029 Min. :-25229 Min. :-17912 Min. :-24672
## 1st Qu.:0.01001 1st Qu.:-19682 1st Qu.: -2760 1st Qu.: -7480
## Median :0.01885 Median :-15750 Median : -1213 Median : -4504
## Mean :0.02087 Mean :-16037 Mean : 63815 Mean : -4986
## 3rd Qu.:0.02866 3rd Qu.:-12413 3rd Qu.: -289 3rd Qu.: -2010
## Max. :0.07251 Max. : -7489 Max. :365243 Max. : 0
##
## DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE
## Min. :-7197 Min. : 0.00 Min. :0 Min. :0.0000
## 1st Qu.:-4299 1st Qu.: 5.00 1st Qu.:1 1st Qu.:1.0000
## Median :-3254 Median : 9.00 Median :1 Median :1.0000
## Mean :-2994 Mean :12.06 Mean :1 Mean :0.8199
## 3rd Qu.:-1720 3rd Qu.:15.00 3rd Qu.:1 3rd Qu.:1.0000
## Max. : 0 Max. :91.00 Max. :1 Max. :1.0000
## NA's :202929
## FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.00000
## Mean :0.1994 Mean :0.9981 Mean :0.2811 Mean :0.05672
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT
## :96391 Min. : 1.000 Min. :1.000
## Laborers :55186 1st Qu.: 2.000 1st Qu.:2.000
## Sales staff:32102 Median : 2.000 Median :2.000
## Core staff :27570 Mean : 2.153 Mean :2.052
## Managers :21371 3rd Qu.: 3.000 3rd Qu.:2.000
## Drivers :18603 Max. :20.000 Max. :3.000
## (Other) :56288 NA's :2
## REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
## Min. :1.000 FRIDAY :50338 Min. : 0.00
## 1st Qu.:2.000 MONDAY :50714 1st Qu.:10.00
## Median :2.000 SATURDAY :33852 Median :12.00
## Mean :2.032 SUNDAY :16181 Mean :12.06
## 3rd Qu.:2.000 THURSDAY :50591 3rd Qu.:14.00
## Max. :3.000 TUESDAY :53901 Max. :23.00
## WEDNESDAY:51934
## REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.01514 Mean :0.05077
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
##
## LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.04066 Mean :0.07817 Mean :0.2305
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1
## Min. :0.0000 Business Entity Type 3: 67992 Min. :0.01
## 1st Qu.:0.0000 XNA : 55374 1st Qu.:0.33
## Median :0.0000 Self-employed : 38412 Median :0.51
## Mean :0.1796 Other : 16683 Mean :0.50
## 3rd Qu.:0.0000 Medicine : 11193 3rd Qu.:0.68
## Max. :1.0000 Business Entity Type 2: 10553 Max. :0.96
## (Other) :107304 NA's :173378
## EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG
## Min. :0.0000 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.3925 1st Qu.:0.37 1st Qu.:0.06 1st Qu.:0.04
## Median :0.5660 Median :0.54 Median :0.09 Median :0.08
## Mean :0.5144 Mean :0.51 Mean :0.12 Mean :0.09
## 3rd Qu.:0.6636 3rd Qu.:0.67 3rd Qu.:0.15 3rd Qu.:0.11
## Max. :0.8550 Max. :0.90 Max. :1.00 Max. :1.00
## NA's :660 NA's :60965 NA's :156061 NA's :179943
## YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.98 1st Qu.:0.69 1st Qu.:0.01 1st Qu.:0.00
## Median :0.98 Median :0.76 Median :0.02 Median :0.00
## Mean :0.98 Mean :0.75 Mean :0.04 Mean :0.08
## 3rd Qu.:0.99 3rd Qu.:0.82 3rd Qu.:0.05 3rd Qu.:0.12
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :150007 NA's :204488 NA's :214865 NA's :163891
## ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08 1st Qu.:0.02
## Median :0.14 Median :0.17 Median :0.21 Median :0.05
## Mean :0.15 Mean :0.23 Mean :0.23 Mean :0.07
## 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38 3rd Qu.:0.09
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :154828 NA's :153020 NA's :208642 NA's :182590
## LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.05 1st Qu.:0.05 1st Qu.:0.00
## Median :0.08 Median :0.07 Median :0.00
## Mean :0.10 Mean :0.11 Mean :0.01
## 3rd Qu.:0.12 3rd Qu.:0.13 3rd Qu.:0.00
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :210199 NA's :154350 NA's :213514
## NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.05 1st Qu.:0.04
## Median :0.00 Median :0.08 Median :0.07
## Mean :0.03 Mean :0.11 Mean :0.09
## 3rd Qu.:0.03 3rd Qu.:0.14 3rd Qu.:0.11
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :169682 NA's :156061 NA's :179943
## YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.98 1st Qu.:0.70 1st Qu.:0.01
## Median :0.98 Median :0.76 Median :0.02
## Mean :0.98 Mean :0.76 Mean :0.04
## 3rd Qu.:0.99 3rd Qu.:0.82 3rd Qu.:0.05
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :150007 NA's :204488 NA's :214865
## ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08
## Median :0.00 Median :0.14 Median :0.17 Median :0.21
## Mean :0.07 Mean :0.15 Mean :0.22 Mean :0.23
## 3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :163891 NA's :154828 NA's :153020 NA's :208642
## LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.04
## Median :0.05 Median :0.08 Median :0.07
## Mean :0.06 Mean :0.11 Mean :0.11
## 3rd Qu.:0.08 3rd Qu.:0.13 3rd Qu.:0.13
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :182590 NA's :210199 NA's :154350
## NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.06 1st Qu.:0.04
## Median :0.00 Median :0.00 Median :0.09 Median :0.08
## Mean :0.01 Mean :0.03 Mean :0.12 Mean :0.09
## 3rd Qu.:0.00 3rd Qu.:0.02 3rd Qu.:0.15 3rd Qu.:0.11
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :213514 NA's :169682 NA's :156061 NA's :179943
## YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.98 1st Qu.:0.69 1st Qu.:0.01
## Median :0.98 Median :0.76 Median :0.02
## Mean :0.98 Mean :0.76 Mean :0.04
## 3rd Qu.:0.99 3rd Qu.:0.83 3rd Qu.:0.05
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :150007 NA's :204488 NA's :214865
## ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.07 1st Qu.:0.17 1st Qu.:0.08
## Median :0.00 Median :0.14 Median :0.17 Median :0.21
## Mean :0.08 Mean :0.15 Mean :0.23 Mean :0.23
## 3rd Qu.:0.12 3rd Qu.:0.21 3rd Qu.:0.33 3rd Qu.:0.38
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
## NA's :163891 NA's :154828 NA's :153020 NA's :208642
## LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.02 1st Qu.:0.05 1st Qu.:0.05
## Median :0.05 Median :0.08 Median :0.07
## Mean :0.07 Mean :0.10 Mean :0.11
## 3rd Qu.:0.09 3rd Qu.:0.12 3rd Qu.:0.13
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :182590 NA's :210199 NA's :154350
## NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE
## Min. :0.00 Min. :0.00 :210295
## 1st Qu.:0.00 1st Qu.:0.00 not specified : 5687
## Median :0.00 Median :0.00 org spec account : 5619
## Mean :0.01 Mean :0.03 reg oper account : 73830
## 3rd Qu.:0.00 3rd Qu.:0.03 reg oper spec account: 12080
## Max. :1.00 Max. :1.00
## NA's :213514 NA's :169682
## HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE
## :154297 Min. :0.00 :156341
## block of flats :150503 1st Qu.:0.04 Panel : 66040
## specific housing: 1499 Median :0.07 Stone, brick: 64815
## terraced house : 1212 Mean :0.10 Block : 9253
## 3rd Qu.:0.13 Wooden : 5362
## Max. :1.00 Mixed : 2296
## NA's :148431 (Other) : 3404
## EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE
## :145755 Min. : 0.000 Min. : 0.0000
## No :159428 1st Qu.: 0.000 1st Qu.: 0.0000
## Yes: 2328 Median : 0.000 Median : 0.0000
## Mean : 1.422 Mean : 0.1434
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :348.000 Max. :34.0000
## NA's :1021 NA's :1021
## OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE
## Min. : 0.000 Min. : 0.0 Min. :-4292.0
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.:-1570.0
## Median : 0.000 Median : 0.0 Median : -757.0
## Mean : 1.405 Mean : 0.1 Mean : -962.9
## 3rd Qu.: 2.000 3rd Qu.: 0.0 3rd Qu.: -274.0
## Max. :344.000 Max. :24.0 Max. : 0.0
## NA's :1021 NA's :1021 NA's :1
## FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5
## Min. :0.00e+00 Min. :0.00 Min. :0.00e+00 Min. :0.00000
## 1st Qu.:0.00e+00 1st Qu.:0.00 1st Qu.:0.00e+00 1st Qu.:0.00000
## Median :0.00e+00 Median :1.00 Median :0.00e+00 Median :0.00000
## Mean :4.23e-05 Mean :0.71 Mean :8.13e-05 Mean :0.01511
## 3rd Qu.:0.00e+00 3rd Qu.:1.00 3rd Qu.:0.00e+00 3rd Qu.:0.00000
## Max. :1.00e+00 Max. :1.00 Max. :1.00e+00 Max. :1.00000
##
## FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9
## Min. :0.00000 Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.08806 Mean :0.0001919 Mean :0.08138 Mean :0.003896
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.0000000 Max. :1.00000 Max. :1.000000
##
## FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13
## Min. :0.00e+00 Min. :0.000000 Min. :0.0e+00 Min. :0.000000
## 1st Qu.:0.00e+00 1st Qu.:0.000000 1st Qu.:0.0e+00 1st Qu.:0.000000
## Median :0.00e+00 Median :0.000000 Median :0.0e+00 Median :0.000000
## Mean :2.28e-05 Mean :0.003912 Mean :6.5e-06 Mean :0.003525
## 3rd Qu.:0.00e+00 3rd Qu.:0.000000 3rd Qu.:0.0e+00 3rd Qu.:0.000000
## Max. :1.00e+00 Max. :1.000000 Max. :1.0e+00 Max. :1.000000
##
## FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.0000000
## Mean :0.002936 Mean :0.00121 Mean :0.009928 Mean :0.0002667
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.0000000
##
## FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
## Min. :0.00000 Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
## Median :0.00000 Median :0.0000000 Median :0.0000000 Median :0.0000000
## Mean :0.00813 Mean :0.0005951 Mean :0.0005073 Mean :0.0003349
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
## Max. :1.00000 Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
##
## AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :0.00
## Mean :0.01 Mean :0.01
## 3rd Qu.:0.00 3rd Qu.:0.00
## Max. :4.00 Max. :9.00
## NA's :41519 NA's :41519
## AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
## Min. :0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median :0.00 Median : 0.00 Median : 0.00
## Mean :0.03 Mean : 0.27 Mean : 0.27
## 3rd Qu.:0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :8.00 Max. :27.00 Max. :261.00
## NA's :41519 NA's :41519 NA's :41519
## AMT_REQ_CREDIT_BUREAU_YEAR
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 1.0
## Mean : 1.9
## 3rd Qu.: 3.0
## Max. :25.0
## NA's :41519
summary(app_test)
## SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR
## Min. :100001 Cash loans :48305 F:32678 N:32311
## 1st Qu.:188558 Revolving loans: 439 M:16066 Y:16433
## Median :277549
## Mean :277797
## 3rd Qu.:367556
## Max. :456250
##
## FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT
## N:15086 Min. : 0.0000 Min. : 26942 Min. : 45000
## Y:33658 1st Qu.: 0.0000 1st Qu.: 112500 1st Qu.: 260640
## Median : 0.0000 Median : 157500 Median : 450000
## Mean : 0.3971 Mean : 178432 Mean : 516740
## 3rd Qu.: 1.0000 3rd Qu.: 225000 3rd Qu.: 675000
## Max. :20.0000 Max. :4410000 Max. :2245500
##
## AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE
## Min. : 2295 Min. : 45000 Unaccompanied :39727
## 1st Qu.: 17973 1st Qu.: 225000 Family : 5881
## Median : 26199 Median : 396000 Spouse, partner: 1448
## Mean : 29426 Mean : 462619 : 911
## 3rd Qu.: 37390 3rd Qu.: 630000 Children : 408
## Max. :180576 Max. :2245500 Other_B : 211
## NA's :24 (Other) : 158
## NAME_INCOME_TYPE NAME_EDUCATION_TYPE
## Businessman : 1 Academic degree : 41
## Commercial associate:11402 Higher education :12516
## Pensioner : 9273 Incomplete higher : 1724
## State servant : 3532 Lower secondary : 475
## Student : 2 Secondary / secondary special:33988
## Unemployed : 1
## Working :24533
## NAME_FAMILY_STATUS NAME_HOUSING_TYPE
## Civil marriage : 4261 Co-op apartment : 123
## Married :32283 House / apartment :43645
## Separated : 2955 Municipal apartment: 1617
## Single / not married: 7036 Office apartment : 407
## Widow : 2209 Rented apartment : 718
## With parents : 2234
##
## REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION
## Min. :0.000253 Min. :-25195 Min. :-17463 Min. :-23722
## 1st Qu.:0.010006 1st Qu.:-19637 1st Qu.: -2910 1st Qu.: -7459
## Median :0.018850 Median :-15785 Median : -1293 Median : -4490
## Mean :0.021226 Mean :-16068 Mean : 67485 Mean : -4968
## 3rd Qu.:0.028663 3rd Qu.:-12496 3rd Qu.: -296 3rd Qu.: -1901
## Max. :0.072508 Max. : -7338 Max. :365243 Max. : 0
##
## DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE
## Min. :-6348 Min. : 0.00 Min. :0 Min. :0.0000 Min. :0.0000
## 1st Qu.:-4448 1st Qu.: 4.00 1st Qu.:1 1st Qu.:1.0000 1st Qu.:0.0000
## Median :-3234 Median : 9.00 Median :1 Median :1.0000 Median :0.0000
## Mean :-3052 Mean :11.79 Mean :1 Mean :0.8097 Mean :0.2047
## 3rd Qu.:-1706 3rd Qu.:15.00 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. : 0 Max. :74.00 Max. :1 Max. :1.0000 Max. :1.0000
## NA's :32312
## FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE
## Min. :0.0000 Min. :0.0000 Min. :0.0000 :15605
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 Laborers : 8655
## Median :1.0000 Median :0.0000 Median :0.0000 Sales staff: 5072
## Mean :0.9984 Mean :0.2631 Mean :0.1626 Core staff : 4361
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 Managers : 3574
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Drivers : 2773
## (Other) : 8704
## CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY
## Min. : 1.000 Min. :1.000 Min. :-1.000
## 1st Qu.: 2.000 1st Qu.:2.000 1st Qu.: 2.000
## Median : 2.000 Median :2.000 Median : 2.000
## Mean : 2.147 Mean :2.038 Mean : 2.013
## 3rd Qu.: 3.000 3rd Qu.:2.000 3rd Qu.: 2.000
## Max. :21.000 Max. :3.000 Max. : 3.000
##
## WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION
## FRIDAY :7250 Min. : 0.00 Min. :0.00000
## MONDAY :8406 1st Qu.:10.00 1st Qu.:0.00000
## SATURDAY :4603 Median :12.00 Median :0.00000
## SUNDAY :1859 Mean :12.01 Mean :0.01883
## THURSDAY :8418 3rd Qu.:14.00 3rd Qu.:0.00000
## TUESDAY :9751 Max. :23.00 Max. :1.00000
## WEDNESDAY:8457
## REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.05517 Mean :0.04204 Mean :0.07747
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE
## Min. :0.0000 Min. :0.0000 Business Entity Type 3:10840
## 1st Qu.:0.0000 1st Qu.:0.0000 XNA : 9274
## Median :0.0000 Median :0.0000 Self-employed : 5920
## Mean :0.2247 Mean :0.1742 Other : 2707
## 3rd Qu.:0.0000 3rd Qu.:0.0000 Medicine : 1716
## Max. :1.0000 Max. :1.0000 Government : 1508
## (Other) :16779
## EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG
## Min. :0.013 Min. :0.000008 Min. :0.001 Min. :0.000
## 1st Qu.:0.344 1st Qu.:0.408066 1st Qu.:0.364 1st Qu.:0.062
## Median :0.507 Median :0.558758 Median :0.519 Median :0.093
## Mean :0.501 Mean :0.518021 Mean :0.500 Mean :0.122
## 3rd Qu.:0.666 3rd Qu.:0.658497 3rd Qu.:0.653 3rd Qu.:0.148
## Max. :0.939 Max. :0.855000 Max. :0.883 Max. :1.000
## NA's :20532 NA's :8 NA's :8668 NA's :23887
## BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:0.047 1st Qu.:0.977 1st Qu.:0.69 1st Qu.:0.01
## Median :0.078 Median :0.982 Median :0.76 Median :0.02
## Mean :0.090 Mean :0.979 Mean :0.75 Mean :0.05
## 3rd Qu.:0.113 3rd Qu.:0.987 3rd Qu.:0.82 3rd Qu.:0.05
## Max. :1.000 Max. :1.000 Max. :1.00 Max. :1.00
## NA's :27641 NA's :22856 NA's :31818 NA's :33495
## ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.074 1st Qu.:0.167 1st Qu.:0.10
## Median :0.000 Median :0.138 Median :0.167 Median :0.21
## Mean :0.085 Mean :0.152 Mean :0.234 Mean :0.24
## 3rd Qu.:0.160 3rd Qu.:0.207 3rd Qu.:0.333 3rd Qu.:0.38
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.00
## NA's :25189 NA's :23579 NA's :23321 NA's :32466
## LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.:0.019 1st Qu.:0.05 1st Qu.:0.049 1st Qu.:0.00
## Median :0.048 Median :0.08 Median :0.077 Median :0.00
## Mean :0.067 Mean :0.11 Mean :0.112 Mean :0.01
## 3rd Qu.:0.087 3rd Qu.:0.13 3rd Qu.:0.138 3rd Qu.:0.01
## Max. :1.000 Max. :1.00 Max. :1.000 Max. :1.00
## NA's :28254 NA's :32780 NA's :23552 NA's :33347
## NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.059 1st Qu.:0.043
## Median :0.004 Median :0.085 Median :0.077
## Mean :0.029 Mean :0.119 Mean :0.089
## 3rd Qu.:0.029 3rd Qu.:0.150 3rd Qu.:0.114
## Max. :1.000 Max. :1.000 Max. :1.000
## NA's :26084 NA's :23887 NA's :27641
## YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE
## Min. :0.000 Min. :0.00 Min. :0.00 Min. :0.000
## 1st Qu.:0.976 1st Qu.:0.69 1st Qu.:0.01 1st Qu.:0.000
## Median :0.982 Median :0.76 Median :0.02 Median :0.000
## Mean :0.978 Mean :0.76 Mean :0.05 Mean :0.081
## 3rd Qu.:0.987 3rd Qu.:0.82 3rd Qu.:0.05 3rd Qu.:0.121
## Max. :1.000 Max. :1.00 Max. :1.00 Max. :1.000
## NA's :22856 NA's :31818 NA's :33495 NA's :25189
## ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.069 1st Qu.:0.167 1st Qu.:0.08 1st Qu.:0.017
## Median :0.138 Median :0.167 Median :0.21 Median :0.046
## Mean :0.147 Mean :0.229 Mean :0.23 Mean :0.066
## 3rd Qu.:0.207 3rd Qu.:0.333 3rd Qu.:0.38 3rd Qu.:0.086
## Max. :1.000 Max. :1.000 Max. :1.00 Max. :1.000
## NA's :23579 NA's :23321 NA's :32466 NA's :28254
## LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE
## Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.:0.06 1st Qu.:0.046 1st Qu.:0.00
## Median :0.08 Median :0.075 Median :0.00
## Mean :0.11 Mean :0.111 Mean :0.01
## 3rd Qu.:0.13 3rd Qu.:0.131 3rd Qu.:0.00
## Max. :1.00 Max. :1.000 Max. :1.00
## NA's :32780 NA's :23552 NA's :33347
## NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.062 1st Qu.:0.046
## Median :0.001 Median :0.093 Median :0.078
## Mean :0.028 Mean :0.123 Mean :0.090
## 3rd Qu.:0.024 3rd Qu.:0.150 3rd Qu.:0.113
## Max. :1.000 Max. :1.000 Max. :1.000
## NA's :26084 NA's :23887 NA's :27641
## YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI
## Min. :0.000 Min. :0.00 Min. :0.00 Min. :0.000
## 1st Qu.:0.977 1st Qu.:0.69 1st Qu.:0.01 1st Qu.:0.000
## Median :0.982 Median :0.76 Median :0.02 Median :0.000
## Mean :0.979 Mean :0.75 Mean :0.05 Mean :0.084
## 3rd Qu.:0.987 3rd Qu.:0.82 3rd Qu.:0.05 3rd Qu.:0.160
## Max. :1.000 Max. :1.00 Max. :1.00 Max. :1.000
## NA's :22856 NA's :31818 NA's :33495 NA's :25189
## ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.069 1st Qu.:0.167 1st Qu.:0.08 1st Qu.:0.019
## Median :0.138 Median :0.167 Median :0.21 Median :0.049
## Mean :0.151 Mean :0.233 Mean :0.24 Mean :0.068
## 3rd Qu.:0.207 3rd Qu.:0.333 3rd Qu.:0.38 3rd Qu.:0.088
## Max. :1.000 Max. :1.000 Max. :1.00 Max. :1.000
## NA's :23579 NA's :23321 NA's :32466 NA's :28254
## LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI
## Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.:0.05 1st Qu.:0.049 1st Qu.:0.00
## Median :0.08 Median :0.078 Median :0.00
## Mean :0.11 Mean :0.113 Mean :0.01
## 3rd Qu.:0.13 3rd Qu.:0.137 3rd Qu.:0.00
## Max. :1.00 Max. :1.000 Max. :1.00
## NA's :32780 NA's :23552 NA's :33347
## NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE
## Min. :0.000 :32797 :23619
## 1st Qu.:0.000 not specified : 913 block of flats :24659
## Median :0.003 org spec account : 920 specific housing: 262
## Mean :0.029 reg oper account :12124 terraced house : 204
## 3rd Qu.:0.028 reg oper spec account: 1990
## Max. :1.000
## NA's :26084
## TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
## Min. :0.000 :23893 :22209
## 1st Qu.:0.043 Panel :11269 No :26179
## Median :0.071 Stone, brick:10434 Yes: 356
## Mean :0.107 Block : 1428
## 3rd Qu.:0.136 Wooden : 794
## Max. :1.000 Mixed : 353
## NA's :22624 (Other) : 573
## OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
## Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.000 Median : 0.0000 Median : 0.000
## Mean : 1.448 Mean : 0.1436 Mean : 1.436
## 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
## Max. :354.000 Max. :34.0000 Max. :351.000
## NA's :29 NA's :29 NA's :29
## DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2
## Min. : 0.0000 Min. :-4361 Min. :0
## 1st Qu.: 0.0000 1st Qu.:-1766 1st Qu.:0
## Median : 0.0000 Median : -863 Median :0
## Mean : 0.1011 Mean :-1078 Mean :0
## 3rd Qu.: 0.0000 3rd Qu.: -363 3rd Qu.:0
## Max. :24.0000 Max. : 0 Max. :0
## NA's :29
## FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6
## Min. :0.0000 Min. :0.0000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:1.0000 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.0000 Median :0.0000000 Median :0.00000 Median :0.00000
## Mean :0.7866 Mean :0.0001026 Mean :0.01475 Mean :0.08748
## 3rd Qu.:1.0000 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000000 Max. :1.00000 Max. :1.00000
##
## FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10
## Min. :0.0e+00 Min. :0.00000 Min. :0.000000 Min. :0
## 1st Qu.:0.0e+00 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0
## Median :0.0e+00 Median :0.00000 Median :0.000000 Median :0
## Mean :4.1e-05 Mean :0.08846 Mean :0.004493 Mean :0
## 3rd Qu.:0.0e+00 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0
## Max. :1.0e+00 Max. :1.00000 Max. :1.000000 Max. :0
##
## FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14
## Min. :0.000000 Min. :0 Min. :0 Min. :0
## 1st Qu.:0.000000 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0.000000 Median :0 Median :0 Median :0
## Mean :0.001169 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0.000000 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :1.000000 Max. :0 Max. :0 Max. :0
##
## FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18
## Min. :0 Min. :0 Min. :0 Min. :0.000000
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0.000000
## Median :0 Median :0 Median :0 Median :0.000000
## Mean :0 Mean :0 Mean :0 Mean :0.001559
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.000000
## Max. :0 Max. :0 Max. :0 Max. :1.000000
##
## FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR
## Min. :0 Min. :0 Min. :0 Min. :0.000
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0.000
## Median :0 Median :0 Median :0 Median :0.000
## Mean :0 Mean :0 Mean :0 Mean :0.002
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.000
## Max. :0 Max. :0 Max. :0 Max. :2.000
## NA's :6049
## AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.002 Mean :0.003 Mean :0.009
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :2.000 Max. :2.000 Max. :6.000
## NA's :6049 NA's :6049 NA's :6049
## AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.000 1st Qu.: 0.000
## Median :0.000 Median : 2.000
## Mean :0.547 Mean : 1.984
## 3rd Qu.:1.000 3rd Qu.: 3.000
## Max. :7.000 Max. :17.000
## NA's :6049 NA's :6049
Most data appears to be in a good condition. Some items that are important to note are, the number of females int he data set is about twice the size of males. The number of people who don’t own a car is about twice the size that do, for homes, the umber who own a home are about twice the size of those who don’t. The number of children has a max number of 19. Which is possible, but very high. They days employed number appears to have some incorrect data. The highest number is over 365,000. Meaning that is 1000 years. No one is quite that age. This age could also throw off the mean.
#Remove the ID column
app_train <- app_train[ -c(1) ]
As IDs are not conclusive for determening someones credit default risk, we will remove it.
# Normalize the income observation
app_train$Sqrt_Income <- app_train$AMT_INCOME_TOTAL %>% sqrt()
# Expand the Region Population Relative
app_train$RPR_Squared <- (app_train$REGION_POPULATION_RELATIVE)^2
#Add a new column to track daily income
app_train$Daily_Income <- (app_train$AMT_INCOME_TOTAL)/365
Adding the above columns can help us normalize income so they are all on similar levels, find greater differences in the region population, and track the daily income of individuals.
#Build a correlation matrix comparing several variables. Uncomment for an output.
# app_train %>% select(TARGET, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_BIRTH, AMT_GOODS_PRICE, AMT_INCOME_TOTAL, AMT_CREDIT, NAME_INCOME_TYPE, OCCUPATION_TYPE, HOUR_APPR_PROCESS_START, WEEKDAY_APPR_PROCESS_START) %>% pairs.panels()
app_train %>% select(TARGET, Daily_Income, RPR_Squared, Sqrt_Income) %>% pairs.panels()
Based on the correlation matrix, we see that There are not high numbers
of correlation between the target variable and others.Finding strong
predictors of the target variables will require additional work.
#Plots ## Scaatterplots
#Build a scatter plot showing the AMT_CREDIT AND AMT_INCOME_TOTAL broken with the color being the TARGET
ggplot(data = app_train, mapping = aes( x = AMT_CREDIT, y = AMT_INCOME_TOTAL, colour=TARGET)) +
geom_point() +
labs(title = "AMT_CREDIT AND AMT_INCOME_TOTAL broken down by TARGET")
This scatter plot shows that there is not a strong relationship between
AMT_Credit and AMT_INCOME_TOTAL in predicting the target variable. Most
incomes hover around the same range of numbers
#Build a bar plot showing the Target variables broken down by occupation type
ggplot(data = app_train, mapping = aes(x = OCCUPATION_TYPE)) +
geom_bar() +
facet_wrap(facets = ~TARGET, ncol = 1) +
labs(title = "Bar plot of Occupation Type by Target")
This chart provides some interesting insights. Laborers, drivers, and sales staff have the highest number of challenges with paying back loans. Coincidentally, those three occupations are also some of the highest, with paying back loans, meaning a large number of individuals in this data set fall into those categories.
#Build a bar plot showing the Target variables broken down by occupation type
ggplot(data = app_train, mapping = aes(x = NAME_INCOME_TYPE)) +
geom_bar() +
facet_wrap(facets = ~TARGET, ncol = 1) +
labs(title = "Bar plot of Income Type by Target")
This chart provides some similar insights as the one above insights. Working and Commercial associate have the highest number of challenges with paying back loans. Similarly, Working and Commercial associate have the highest number of those who pay back loans. We see this pattern throughout the data.
#Build a bar plot showing the Target variables broken down by occupation type
ggplot(data = app_train, mapping = aes(x = NAME_EDUCATION_TYPE)) +
geom_bar() +
facet_wrap(facets = ~TARGET, ncol = 1) +
labs(title = "Bar plot of Education Type by Target")
This chart provides some interesting inights. Secondary / secondary special have the highest number of challenges with paying back loans. Similarly, secondary / secondary special also have the highest number of those who pay back loans. We see this pattern throughout the data.
#Build a boxplot of the Target variable compared to the AMT_INCOME_TOTAL
ggplot(data = app_train, mapping = aes(x = TARGET, y = AMT_INCOME_TOTAL)) +
geom_boxplot() +
labs(title = "Boxplot of Income by Target")
This shows us there is a large outlier with the 1 target. We will want
to remove that outlier and rerun the plot to achieve better results.
This data set is extremely large and complex. It has been difficult to find patterns of characteristics of those who don’t have payment difficulties and those that do. That data seems relatively clean, though there are some outliers/questionable observations. More complex models, and feature engineering will be needed to gather deeper insights.
#train_tree <- C5.0(formula = TARGET ~ .,data = app_train)
#train_tree$size
#plot(train_tree)
#train_tree <- rpart(TARGET~., data=app_train, method = 'class')
#rpart.plot(train_tree)
The tree model code is commented out but provided for future use in building models.
# Simple Logistic Regression Model
default_model <- app_train %>% glm(formula = TARGET ~ DAYS_REGISTRATION + DAYS_BIRTH + WEEKDAY_APPR_PROCESS_START + HOUR_APPR_PROCESS_START, family = "binomial")
#Show the logistic regression model
summary(default_model)
##
## Call:
## glm(formula = TARGET ~ DAYS_REGISTRATION + DAYS_BIRTH + WEEKDAY_APPR_PROCESS_START +
## HOUR_APPR_PROCESS_START, family = "binomial", data = .)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.974e-01 3.888e-02 -23.080 <2e-16 ***
## DAYS_REGISTRATION 1.979e-05 2.118e-06 9.346 <2e-16 ***
## DAYS_BIRTH 6.528e-05 1.665e-06 39.209 <2e-16 ***
## WEEKDAY_APPR_PROCESS_STARTMONDAY -5.976e-02 2.335e-02 -2.559 0.0105 *
## WEEKDAY_APPR_PROCESS_STARTSATURDAY -6.531e-02 2.604e-02 -2.508 0.0121 *
## WEEKDAY_APPR_PROCESS_STARTSUNDAY -8.411e-02 3.349e-02 -2.512 0.0120 *
## WEEKDAY_APPR_PROCESS_STARTTHURSDAY -7.859e-03 2.313e-02 -0.340 0.7340
## WEEKDAY_APPR_PROCESS_STARTTUESDAY 2.854e-02 2.263e-02 1.261 0.2071
## WEEKDAY_APPR_PROCESS_STARTWEDNESDAY 5.090e-04 2.294e-02 0.022 0.9823
## HOUR_APPR_PROCESS_START -3.448e-02 2.031e-03 -16.972 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 172542 on 307510 degrees of freedom
## Residual deviance: 170212 on 307501 degrees of freedom
## AIC: 170232
##
## Number of Fisher Scoring iterations: 5
This is a simple Logistic Regression model only viewing several variables. We would want to build a more advanced to gather future insights and see if the p-values stay consistant.